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Abstract — The probabilistic threshold query is one of the most common queries in uncertain databases, where a result satisfying 
the query must be also with probability meeting the threshold requirement. In this paper, we investigate probabilistic threshold 
keyword queries (PrTKQ) over XML data, which is not studied before. We first introduce the notion of quasi-SLCA and use it 
to represent results for a PrTKQ with the consideration of possible world semantics. Then we design a probabilistic inverted 
(PI) index that can be used to quickly return the qualified answers and filter out the unqualified ones based on our proposed 
lower/upper bounds. After that, we propose two efficient and comparable algorithms: Baseline Algorithm and PI index-based 
Algorithm. To accelerate the performance of algorithms, we also utilize probability density function. An empirical study using real 
and synthetic data sets has verified the effectiveness and the efficiency of our approaches. 

Index Terms — Probabilistic XML, Threshold Keyword Query, Probabilistic Index. 



1 Introduction 

Uncertainty is widespread in many web applications, 
such as information extraction, information integra- 
tion, web data mining, etc. In uncertain database, 
probabilistic threshold queries have been studied ex- 
tensively where all results satisfying the queries with 
probabilities equal to or larger than the given thresh- 
old values are returned Q, Q, 0, (4), 0. However, 
all of these works were studied based on uncertain 
relational data model. Because the flexibility of XML 
data model allows a natural representation of un- 
certain data, uncertain XML data management has 
become an important issue and lots of works have 
been done recently. For example, many probabilistic 
XML data models were designed and analyzed 0, 
0, 0, 0, El- Based on different data models, query 
evaluation 0, EL El, El/ El/ algebraic manipu- 
lation [8 1 and updates [6\, [10\ were studied. However, 
most of these works concentrated on structured query 
processing, e.g., twig queries. In this paper, we pro- 
pose and address a new interesting and challenging 
problem of Probabilistic Threshold keyword Query 
(PrTKQ) over uncertain XML databases based on 
quasi-SLCA semantics, which is not studied before as 
far as we know 

In general, an XML document could be viewed as 
a rooted tree, where each node represents an element 
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or contents. XIRQL [14| supports keyword search in 
XML based on structured queries. However, users 
may not have the knowledge of the structure of XML 
data or the query language. As such, supporting 
pure keyword search in XML has attracted extensive 
research. The LCA-based approaches will identify the 
LCA node first, which contains every keyword under 
its subtree at least once El, (H, E|, El, El/ EOT 
ll2"Tl . Since the LCA nodes sometimes are not very 
specific to users' query, Xu and Papakonstantinou 
1 20 1 proposed the concept of SLCA (smallest lowest 
common ancestor), where a node v is regarded as an 
SLCA if (a) the subtree rooted at the node v, denoted 
as T su b(v), contains all the keywords, and (b) there 
does not exist a descendant node v' of v such that 
T 8 ubW) contains all the keywords. In other words, if 
a node is an SLCA, then its ancestors will be definitely 
excluded from being SLCAs. The SLCA semantics of 
model keyword search result on a deterministic XML 
tree are also applied [22j |, [16] , [19 1 . 

Based on the SLCA semantics, 11231 discussed top-fc 
keyword search over a probabilistic XML document. 
Given a keyword query q and a probabilistic XML 
document (PrXML), [23[ returned the top k most 
relevant SLCA results (PrSLCAs) based on their prob- 
abilities. Different from the SLCA semantics over de- 
terministic XML documents, a node v being a PrSLCA 
can only exclude its ancestors from being PrSLCAs 
by a probability. This probability can be calculated 
by aggregating the probabilities of the deterministic 
documents (called possible worlds) W implied in the 
PrXML where v is an SLCA in each deterministic 
document £ W. 

However, it is not suitable to directly utilize the 
PrSLCA semantics for evaluating PrTKQs because the 
PrSLCA semantics are too strong. In some applica- 
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tions, users tend to be confident with the results to 
be searched, so relatively high probability threshold 
values may be given. Consequently, it is very likely 
that no qualified PrSLCA results will be returned. To 
solve this problem, we propose and utilize a so-called 
quasi-SLCA semantics to define the results of a PrTKQ 
by relaxing the semantics of PrSLCA with regards to 
a given threshold value, i.e., besides the probability 
of v being a PrSLCA in PrXML, the probability of a 
node v being a quasi-SLCA in PrXML may also count 
the probability of v's descendants being PrSLCAs in 
PrXML if their probabilities are below the specified 
threshold value. In other words, a node v being a 
quasi-SLCA will exclude its ancestors from being 
quasi-SLCAs by a probability only when this prob- 
ability is no less than the given threshold; otherwise, 
this probability will be included for contributing to its 
ancestors. This is different from the PrSLCA semantics 
that excludes the probability contribution from child 
nodes. 
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Fig. 1 . A probabilistic XML data tree 



Example 1: Consider an aircraft-monitored battle- 
field application, where the useful information will 
be taken as Aerial photographies. Through analysing 
the photographies, we can extract the possible objects 
(e.g., road, factory, airport, etc.) and attach some text 
description to them with probabilities, which can be 
stored in the format of PrXML. Figure [TJ is a snapshot 
of an aircraft-monitored battlefield XML data. By is- 
suing a keyword query {hazard, building}, a military 
department would find the potential areas containing 
hazard buildings above a probability threshold. 

Based on the semantics of PrSLCA, any of the nodes 
library (probability = 0.3), area\( = 0.14), sub_region\( 
= 0.168), heliport( = 0.24), sub_regiori2( = 0.32) and 
region( = 0.088) can become an PrSLCA result. The 
detailed procedure of calculating the probabilities of 
results will be shown later. As we know, the users 
generally specify a threshold value a as the confidence 
score with their issued query, e.g., a = 0.40 represent- 
ing that the users prefer to see the answers with their 



probabilities up to 0.40. In this condition, no results 
can be returned to the users. 

However, from Figure [TJ we can see that if the 
probabilities of library and area2 could contribute to 
their parent nodes, area\ and sub_regiori2 would be- 
come quasi-SLCA results. Unfortunately, the PrSLCA 
semantics exclude them from being results. This mo- 
tivates us to relax the PrSLCA semantics to the quasi- 
SLCA semantics. According to the quasi-SLCA se- 
mantics, the probabilities of area\ and sub_regiori2 
being the quasi-SLCA results are 0.44 and 0.56 with 
the contributions of their child nodes library and 
area2, respectively. As such, area\ and sub_regiori2 
are deemed as the interesting places to be returned. 

Given a PrTKQ, our problem is to quickly com- 
pute all the quasi-SLCA nodes with their probabilities 
meeting the threshold requirement. For users issuing 
PrTKQs, they generally expect to see the complete 
quasi-SLCA answer set as early as possible and do not 
need to know the accurate probability of each answer, 
which motivates us to design a Probabilistic Inverted 
(PI) index and Pi-based efficient algorithm for quickly 
identifying quasi-SLCA result candidates. 

We summarize the contributions of this paper as 
follows: 

« Based on our proposed quasi-SLCA result defi- 
nition, we study probabilistic threshold keyword 
query over uncertain XML data, which satisfies 
the possible world semantics. To the best of our 
knowledge, this problem has not been studied 
before. 

• We design a probabilistic inverted (PI) index 
that can quickly compute the lower bound and 
upper bound for a threshold keyword query, by 
which lots of unqualified nodes can be pruned 
and qualified nodes can be returned as early as 
possible. To keep the effectiveness of pruning, the 
probability density function is employed based 
on the assumption of Gaussian distribution. 

• We propose two algorithms, a comparable base- 
line algorithm and a Pi-based Algorithm, to ef- 
ficiently find all the quasi-SLCA results meeting 
the threshold requirement. 

> Experimental evaluation has demonstrated the 
efficiency and effectiveness of the proposed ap- 
proaches. 

The rest of this paper is organized as follows. In 
Section |2l we introduce the probabilistic XML model 
and the problem definition of probabilistic threshold 
keyword query. Section [3] shows the procedure of effi- 
ciently finding quasi-SLCA results using an example. 
Section |4] first presents the data structure of PI index, 
discusses the basic building operations and pruning 
techniques of PI index, and provides the building 
algorithm of PI index. In Section|5j we propose a com- 
parable baseline algorithm and a Pl-based algorithm 
to find the qualified quasi-SLCA results. We report the 
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experimental results in Section [6] Section [7| discusses 
related works and Section [8] concludes the paper. 

2 Probabilistic Data Model and 
Problem Definition 

Probabilistic Data Model: A PrXML document de- 
fines a probability distribution over a space of de- 
terministic XML documents. Each deterministic doc- 
ument belonging to this space is called a possible 
world. A PrXML document represented as a labelled 
tree has ordinary and distributional nodes. Ordinary 
nodes are regular XML nodes and they may appear in 
deterministic documents, while distributional nodes 
are only used for defining the probabilistic process of 
generating deterministic documents and they do not 
occur in those documents. 

In this paper, we adopt a popular probabilistic XML 
model, PrXML*™ 11 '™"*} 02), |23), which was first 
discussed in [7]. In this model, a PrXML document 
is considered as a labelled tree where distributional 
nodes have two types, IND and MUX. An IND node 
has children that are independent of each other, while 
the children of a MUX node are mutually-exclusive, that 
is, at most one child can exist in a random instance 
document (called a possible world). A real number 
from (0,1] is attached on each edge in the XML tree, 
indicating the conditional probability that the child 
node will appear under the parent node given the 
existence of the parent node. An example of a PrXML 
document is given in Fig. [TJ Unweighted edges have 
1 as the default conditional probability. 

The Semantics of PrSLCA in PrXML: According to 
the semantics of possible worlds, the global probabil- 
ity of a node v being a PrSLCA with regard to a given 
query q in the possible worlds is defined as follows: 

m 

Pr ?lca(<l> v ) = ^2{Pr(w l )\slca(q,v,w i ) = true} (1) 

i=l 

where w\ , . . . , w m denotes the possible worlds implied 
by slca(q,v,Wi) = true indicates that v is an SLCA 
in the possible world Wi for the query q. Pr(wi) is 
the existence probability of the possible world Wi . The 
symbol G means Prf lca (q, v) is the global probability of 
a node v being an SLCA w.r.t. q in all possible worlds. 
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Example 2: Consider a small PrXML in 
Figure |2] a and all generated possible worlds 
in Figure |2]{b,c,d,e,f,g,h,i} where the solid line 
represents the existence of the edge while the dashed 
line represents the absence of the edge. Given a 
possible world, we can compute its global probability 
based on the existence/absence of the edges in the 
possible world, e.g., Pr(w d ) = (l-0.5)*0.3*0.4 = 0.06. 

Given a keyword query q = {ki,k2}, we can com- 
pute the global probability of c 2 being a PrSLCA w.r.t. 
qby using Prf lca (q,c 2 ) = Pr(w b ) + Pr(w d ) + Pr(w f ) + 
Pr\w h ) = 0.06 + 0.06 + 0.09 + 0.09 = 0.30. Similarly, 
we have the global probability of a 4 being a PrSLCA 
w.r.t. q by using p rf lca (q, 0,4) = Pr(w e ) = 0.14. 

The Semantics of quasi-SLCA in PrXML: 

Definition 1: Quasi-SLCA: Given a keyword query 
q and a threshold value a, a node v is called a quasi- 
SLCA if and only if (1) v or its descendants are SLCAs 
in a set W of possible worlds; (2) the aggregated 
probability of v and its descendants to be SLCAs in W 
is no less than a; (3) no descendant nodes of v satisfy 
both of the above conditions in any set of possible 
worlds that overlaps with W. 

In other words, if a descendant node Vd of v is 
a quasi-SLCA, then the probability of vd has to be 
excluded from the probability of v being a quasi- 
SLCA. It means that the set of possible worlds that 
Vd appears does not overlap with the set of possible 
worlds that v or its other descendants appear. 

Given a query q, we can compute Prf lca (q,v) in 
a bottom-up manner, where Pr^ ica (q, v) stands for 
the local probability for v being an SLCA in the 
probabilistic subtree rooted at v. For example, 04 in 
Figure EJa) is a subtree of Figured] P r ^i C a(l-> a i) can 
be used to compute the PrSLCA probability of a 2 and 
oi. From Prf lca (q,v), we can easily get Prf lca (q,v) 
b y Pr ?ica{^ v ) = Pr(path r ^ v ) x Pr^ lca (q,v) where 
Pr(path r ^ v ) indicates the existence probability of v 
in the possible worlds. It can be computed by mul- 
tiplying the conditional probabilities along the path 
from the root r to v. 

Now, we define quasi-SLCA based on PrSLCA and 
the parent-child relationship. For an IND node v, we 
have: 

Pr qua. S i-slca(Q> V ) = Pr ?ica(<li v ) + Pr(path r ^ v )x 
(1 ~ Hv'Echild(v)Av'(Va U a 3 i _ Pr slca(.ti v '))) 

(2) 

where the child node v' of v is an SLCA node, but not 
a quasi-SLCA node. 

For MUX node v, we have: 

p r% UU i-,lc*(Q> v ) = Pr ?ica(V'V) + Pr(path r ^ v )x 
2Z{Pr^ lca (q,v')\v' G child(v) V quast } 

(3) 

Note, IND or MUX nodes are normally not allowed 
to be SLCA result nodes because they are only distri- 
butional nodes. As such, for the above IND or MUX 
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node v, we may use its parent node v p (with v as a 
sole child) to represent the SLCA result node. 

Example 3: Let's consider Example |2] again. First 
assume the specified threshold value is 0.40, then 
the global probability of a± being a quasi-SLCA re- 
sult can be calculated by using Prf uasl _ slca (q,a 4 ) = 
Pr? lca (q,a 4 ) + Pr(path r ^ a4 ) * (1 - (1 - Pr^ ca {q,c 2 ))) 
= 0.14 + 0.30 = 0.44 because child c 2 is an SLCA node 
but not a quasi-SLCA node w.r.t. the given threshold. 
So c 2 's SLCA probability contributes to its parent 
node 04. If the threshold is decreased to 0.30, then 
c 2 will be taken as a qualified quasi-SLCA result and 
will not contribute to a±. In this case, 0,4 cannot be- 
come a quasi-SLCA result because Prf uasi _ s i ca (q, 04) 

= P r ?ica(<l, a i) = °- 14 < °- 30 - If the threshold is further 
decreased to 0.14, both c 2 and 04 are qualified quasi- 
SLCA results. 

Definition 2: Probabilistic Threshold Keyword 
Query: (PrTKQ) Given a keyword query q and a 
threshold a, the results of q over a probabilistic 
XML data T is a set R of quasi-SLCA nodes with 
their probabilities equal to or larger than a, i.e., 
l ca (q,v) ><r forVwei?. 
In this work, we are interested in how to efficiently 
compute the quasi-SLCA answer set for a PrTKQ over 
a probabilistic XML data. 

3 Overview of this Work 

A naive method to answer a PrTKQ is to enumerate 
all possible worlds and apply the query to each 
possible world. Then, we can compute the overall 
probability of each quasi-SLCA result and return the 
results meeting the probability threshold. However, 
the naive method is inefficient due to the huge num- 
ber of possible worlds over a probabilistic XML data. 
Another method is to extend the work in Il23l to 
compute the probabilities of quasi-SLCA candidates. 
Although it is much more efficient than the naive 
method, it needs to scan the keyword node lists and 
calculate the keyword distributions for all relevant 
nodes. Therefore, that motivates our development of 
efficient algorithms which not only avoids generating 
possible worlds, but also prunes more unqualified 
nodes. 

To accelerate query evaluation, in this paper we 
propose a prune-based probabilistic threshold key- 
word query algorithm, which determines the qualified 
results and filters the unqualified candidates by using 
off-line computed probability information. To do this, 
we need to first calculate the probability of each 
possible query term within a node, which is stored 
as an off-line computed probabilistic index. Within 
a node, any two of its contained terms may appear 
in the IND or MUX ways. To precisely differentiate 
IND and MUX, we utilize different parts to represent 
the probabilities of possible query terms appearing 
in MUX way, while the terms in each part hold IND 



relationships. In other words, the different parts of 
terms in a node are mutual-exclusive (MUX), e.g., a± 
and 05 in Figure |3] consists of three parts. 

Given a keyword query and a threshold value, 
we first load the corresponding off-line computed 
probabilistic index w.r.t. the keyword query and then 
on-the-fly calculate the range of probabilities of a node 
being a result of the keyword query using the pre- 
computed probabilistic index in a bottom-up strategy. 
Here, the range of probabilities can be represented by 
two boundary values: lower bound and upper bound. 
By comparing the lower / upper bounds of candidates, 
the qualified results can be efficiently identified. 

The followed two examples briefly demonstrate 
how we calculate the lower /upper bounds based on 
a given keyword query and the off-line computed 
probabilistic index, and how we apply the on-line 
computed lower /upper bounds to prune the unqual- 
ified candidates and determine the qualified ones. 
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Fig. 3. PI index and Lower/Upper Bound for a query 
{ki,k 2 } over the given PrXML 

Figure [3] shows the lower /upper bounds of each 
node in Figure [j] where the probability of each 
individual term is calculated offline while the 
lower /upper bounds are computed on-the-fly based 
on the given query keywords. Let's first introduce the 
related concepts briefly: the probability of a term in 
a node represents the total local probability of the 
term appearing in all possible worlds to be generated 
for the probabilistic subtree rooted at the node, e.g., 
Pr(ki,a 2 ) = 0.65 and Pr(k 2 ,a 2 ) = 0.916; the lower 
bound value represents the minimal total local prob- 
ability of the given query keywords appearing in all 
the possible worlds w.r.t. the probabilistic subtree, e.g., 
LB(/ci/c 2 , a 2 )=0.65*0.916=0.595; the upper bound value 
represents the maximal total local probability of the 
given query keywords appearing in all the possible 
worlds w.r.t. the probabilistic subtree because the key- 
words may be independent or co-occur, e.g., UB(fcifc 2 , 
a 2 ) = min{0.65, 0.916} = 0.65 no matter whether they 
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are independent. By multiplying the path probability, 
the local probability can be transformed into the 
global probability. For the nodes containing MUX 
semantics, we group the probabilities of its terms 
into different parts, any two of which are mutually- 
exclusive as shown in a\, and 05 in Figure |3 The 
details of computing the lower/upper bounds for the 
IND and MUX semantics in the following section. 

Example 4: Consider a PrTKQ {fci,fc 2 } with <r=0.40 
again. 05, c 2 and cj can be pruned directly without 
calculation because their upper bounds are all lower 
than 0.40. We need to check the rest nodes a\, a 2 , 

03 and 04. For 04, after computation, the probabil- 
ity of 04 being a quasi-SLCA result is 0.44, which 
is larger than the specified threshold value 0.40, so 

04 will be taken as a result. After that, the result 
of 04 can be used to update the lower bound and 
upper bound of a 2 , (LB=0.595, UB=0.65) -> (LB=0.155, 
UB=0.21). As a consequence, a 2 should be filtered 
due to UB(a 2 ) = 0.21 < a = 0.40. Similarly, a 3 
can be computed and selected as a result because its 
probability is 0.56. Since (13 and 04 having been the 
quasi-SLCA results, the bounds of ax can be updated 
as (LB=0.890, UB=0.950) -> (LB=0.136, UB=0.196). As 
such, ai can be pruned because its upper bound is 
lower than 0.40. From this example, we can find that 
many answers can be pruned or returned without 
the need to know their accurate probabilities, and the 
effectiveness of pruning would be accelerated greatly 
with the increase of users' search confidence. 

As an acute reader, you may find that we have to 
compute the probability of 04 being a quasi-SLCA 
because it cannot determine whether or not is 
a qualified result to be output only based on its 
lower /upper bound values. To exactly calculate the 
probability of 0,4 being a quasi-SLCA, we have to 
access its child/descendant nodes, e.g., ci, 03,03, al- 
though c 2 has been recognized as a pruned node 
before we start to process a 4 . If an internal node 
depends on a larger number of pruned nodes, the 
effectiveness of pruning will be degraded to some 
extent. To fix this challenging problem, we will in- 
troduce Probability Density Function PDF that can be 
used to approximately compute the probability of a 
node, the result of which can be used to update the 
lower bound and upper bound of its ancestor nodes 
further. The details are provided and discussed with 
algorithms later. 

4 Probabilistic Inverted Index 

In this section, we describe our Probabilistic Inverted 
(PI) index structure for efficiently evaluating PrTKQ 
queries over probabilistic XML data. In keyword 
search on certain XML data, inverted indexes are 
popular structures, e.g., fl"6l , EOl . The basic technique 
is to maintain a list of lists, where each element in 
the outer list corresponds to a domain element (i.e., 



a keyword). Each inner list stores the ids of XML 
nodes in which the given keyword occurs, and for 
each node, the frequencies or the weight at which the 
keyword appears or takes. In this work, we introduce 
a probabilistic version of this structure, in which we 
store for each keyword a list of node-ids. Along with 
each node-id, we store the probability values that the 
subtree rooted at the node may contain the given 
keyword. The probability values in inner lists can be 
used to compute lower bound and upper bound on- 
the-fly during PrTKQ evaluation. 

Figure |4] shows an example of a probabilistic in- 
verted index of the data in Figure [U At the base of 
the structure is a list of keywords storing pointers to 
lists, corresponding to each term in the XML data T, 
This is an inverted array storing, for each term in T, a 
pointer to a list of triple tuples. In the list ki-list corre- 
sponding ki E T, the triple | (v_id, Pr(patty^„), {pi, ...}) I 
records the node v Q, the conditional probability 
from the root to v, and the probability set that may 
contain single probability value or multiple probability 
value. Single probability value represents that all the 
keyword instances in the subtree can be considered 
as independent in probability, e.g., the confidence of 
a 2 containing fcj is {0.65}, while multple probability 
value means that the keyword instances belonging to 
different sets occur mutually, e.g., the confidence of 03 
containing ki is a set {0.8, 0.86, 0.82}, that represents 
the different possibilities of fcj occurring in 0,3. 

4.1 Basic Operations of Building PI Index 

To build PI index, we need to traverse the given XML 
data tree once in a bottom-up method. During the 
data traversal, we will apply the following operations 
that may be used solely or in their combinations. The 
binary operation X x' Y promotes the probability 
of Y to its parent node X. The binary operation X 
^szbhng y promotes the probabilities of two sibling 
nodes X and Y to their parent node. The n-ary case can 
be processed by calling for the corresponding binary 
cases one by one. 

Assume vi contains the keywords {ki, k 2 , ki, 
km} and the conditional probability Pr(path Vp ~ >Vl ) 
is Ai; and v 2 contains the keywords {ki, fej+i, k mi } 
and the conditional probability Pr(path Vp ^ >V2 ) is A 2 . 

Operatorl-vi p<l sWin s- IND v 2 : If vi and v 2 are inde- 
pendent sibling nodes, we can directly promote their 
probabilities to their parent v p , then we have, 

{Ai * Pr(kj,vi) j < i; 

l-(l-X 1 *Pr(k j ,v 1 )) 
(l-X 2 *Pr(k j ,v 2 )) i<]<m<m,; 
X 2 *Pr(kj,v 2 ) ■m<i<m i \ 

(4) 

1. The symbol v is used to represent a node's name or a node's 
id without confusions in the following sections. Here, v is the id of 
the node v 
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Fig. 4. A probabilistic Inverted Index 



Operator2-Vx \a/' ind v 2 : If V2 is an independent 
child of Vi, we can directly promote the probability 
of i>2 to v\, then we have, 
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Pr(kj,vi) 



Pr(kj,vi) j<i; 
l-(l-Pr(k J}Vl )) 

(1 - A 2 * Pr{k ] ,v 2 )) i < j < m < m,; 
\ 2 *Pr{kj,v 2 ) m<j<m t ; 

(5) 

Example 5: Let's show the procedure of comput- 
ing Cl \A smm 3' IND c2^ smm9 ' IND c-i in Figure E] us- 
ing Operatorl and Operator2. Firstly we compute 
Cl M sMm 9' IND c2 and promote the probability of key- 
words to their parent 04 by Operatorl, i.e., Pr(ki, (14) 
= 1 - (1 - 0.5*1.0)*(1 - 0.3*1.0) = 0.65 and Pr(fc 2 ,a 4 ) 
= 0.3. And then, we compute 04 \a/' ind C3 using 
operator^ i.e., Pr(k 2 ,a 4 ) = 1 - (1 - 0.3)*(1 - 0.4) = 
0.58 while Pr(ki,a i ) do not change because c 3 only 
contains k 2 here. And the conditional probability from 
the root to CI4 is 1.0. Therefore, k\ — > (04, 1.0, 0.65) 
and k 2 —> (04, 1.0, 0.58) will be inserted in PI index, 
respectively. 

Operator3-Vx \A slWm 3' MUX v 2 : If v x and v 2 are two 
mutually-exclusive sibling nodes and v. p is their par- 
ent, then we generate two parts in v p by v p \a/' IND v\ 
and v p \a/' ind v 2 , respectively 

Operator4-Vx \a/> aiux v 2 : If v 2 is a mutually- 
exclusive child node of v\, then we can get the ag- 
gregated probability by v\ m/' IND i; 2 . 

In the above four basic operators, we assume the 
terms independently appear in v x and i> 2 . When the 
nodes v x and v 2 contain mutually-exclusive parts, we 
need to deal with each part using the four basic 
operators. 

Given two independent sibling nodes «i (Ai) and v 2 
(A 2 ) where only v 2 contains a set of mutually-exclusive 
parts {p mi ,Pm 2 j •■•} with conditional probability A mi . 
In this case, we can apply the operation vi \x\ sMm aJ ND 
p mi for each part p mi . The computed results are 
maintained in different parts in their parent v p . 

Example 6: Consider an independent node C5 and a 
node a 5 consisting of Cq, c-j and c§ in Figured] We first 
promote c§, C7 and Cs to 05 that consists of three parts: 



c 5 \/\sibimg,iND flg can ca n ec j ^-q compute the prob- 
ability with regards to their parent 03. To do this, we 
apply C5 \/\ sMm 9, IND part; for each part; e a 5 using 
Operatorl. The results are shown in Figure |3] - 03. 
After that, we can insert k x (a 3 , 0.8, 0.8, 0.86, 0.82) 
and k 2 (a 3 , 0.8, 0.5, 0.3, 0) into PI index. 

If both vi and v 2 contain a set of mutually-exclusive 
parts, respectively, then we can do pairwise aggre- 
gations across the two sets of parts. Building PI in- 
dex needs to scan the given probabilistic XML data 
only once. Assume that the probabilistic XML has 
been encoded using probabilistic Dewey codes. The 
basic idea of building PI index is to progressively 
process the document nodes sorted by Dewey codes 
in ascending order, i.e., the data can be loaded and 
processed in a streaming strategy. When a leaf node 
vi is coming, we will compute the probability of 
each term in the leaf node vi. After that, the terms 
with their probabilities in vi will be written into PI 
index. Next, we need promote the terms and their 
probabilities of vi to the parent v p of vi based on 
the operation types in Section 14.11 After the node 
stream is scanned completely, the building algorithm 
of PI index will be terminated. We don't provide the 
detailed building algorithm in this paper. 

4.2 Pruning Techniques using PI Index 

In this subsection, we first show how to prune the 
unqualified nodes using the proposed lower /upper 
bounds. And then, we explain how to compute 
lower /upper bounds, and how to update the up- 
per/lower bounds based on intermediate results dur- 
ing the query evaluation. 

By default, the node lists in PI index are sorted in 
the document order. Pr(ki,v) represents the overall 
probability of a keyword ki in a node v. It is obvious 
that the overall probability of a keyword appearing in 
a node is larger than or equal to that of the keyword 
appearing in its descendant nodes. And the overall 
probability value for each keyword in a node can be 
computed and stored in PI index offline. 
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Consider a node v and a PrTKQ q containing a set 
of keywords {fci, fe 2) fc*}. If all the terms in v are 
independent, then we have, 

t 

LB(q,v) = Y[Pr(h,v) (6) 

2 = 1 

UB(q,v) = min{Pr(ki,v)\l <i <t} (7) 

Most of the time, v consists of a set of parts 
{vpi,vp2, ...,vp m } that are mutually-exclusive. In this 
case, the lower bound of v would be generated from 
a part vpj that gives the highest lower bound value 
while the upper bound of v would be generated from 
another part vpi that gives the highest upper bound 
value, in which j may be equal to or not equal to i. 

t 

LB(q,v) = maxi<j< m {Y[Pr(ki,vpj)} (8) 

i=i 

U B{q, v) = maxi<j< m {min{Pr(ki, vpj)\l < i < t}} 

(9) 

Where vpj must satisfy the criteria: (1) LB(q, v) > 0; 
(2) cannot find another part vp'j having 

n'=i Pr ( fc i' w ^) > and min{Pr(ki,vp' j )\l < 

1 < t} > min{Pr{ki,vpj)\l < i < t}. Otherwise, 
UB(g, v) and LB(<j, v) will be set as zero. 

Example 7: Let's consider a 3 in Figure [3] as 
an example. The first and second parts can 
generate lower and upper bounds: Part 1 — > 
LB({/ci,fc 2 },a 3 )=0.32, UB({/ci, fe},a 3 )=0.4; and Part 

2 ->■ LB({fci,fc 2 },a 3 )=0.206, UB({fc 1; fc 2 },a 3 )=0.24. Be- 
cause Part 1 can produce a higher upper bound than 
Part 2, the lower and upper bounds of a 3 will come 
from Part 1, which guarantees that a 3 can be a quasi- 
SLCA candidate with a higher probability. Since Part 

3 does not contain full keywords, i.e., missing fc 2 , it 
cannot generate lower and upper bounds. 

Property 1: [Upper Bound Usage] A node v can 
be filtered if the overall probability Pr(ki,v) of any 
keyword ki ( ki <E v and ki € q) is lower than the 
given threshold value a, i.e., 3fc,;, Pr(ki 1 v) < a. 

Proof: Since p rf uasi _ slca {q, v) < 
min{Pr{ki,v)\ki € q} < Pr(\fki € q, v), we 
have min{Pr(ki,v)\ki € q} as the upper bound 
probability of v becoming a qualified quasi-SLCA 
node. Therefore, if a node v holds the inequation 
Pr(k i7 v) < a, then Prf uasi _ alca (q, v) must be lower 
than a. As such, v can be filtered. □ 

Property 2: [Lower Bound Usage] The nodes v can 
be returned as required results if we have LB(q,v) > 
a and UB(q, vj) < a where Vd is any child or descen- 
dant node of v. 

Proof: UB{q, Vd) < a means that all the keyword 
nodes in the subtree rooted at v will contribute their 
probabilities to node v. In other words, no decendant 
node of v could be a quasi-SLCA so the lower bound 
probability LB(q, v) will not be deducted. Therefore, if 



we have for LB(q, v) > a, then Prf Ufsi _ slca (q, v) > a. 
As such, v can be returned as a quasi-SLCA result. □ 

Example 8: Let's continue Example a 3 can be 
directly returned as a qualified answer for the given 
threshold a{ = 0.4). This is because c 2 , cj and 
are filtered due to their upper bound less than the 
threshold a{ = 0.4). 

To update the lower /upper bound values during 
query evaluation, one way is to treat the different 
types of nodes differently, by which the updated 
lower /upper bounds may obtain better precision. But 
the disadvantage of this way is to easily affect the 
efficiency of bound update. This is because, given a 
current node having multiple quasi-SLCA nodes as 
its descendant nodes, it is required to know the de- 
tailed relationships (IND or MUX) among the multiple 
quasi-SLCA nodes. To avoid the disadvantage, we 
do not separate the different types of distributional 
nodes, under which the multiple quasi-SLCA nodes 
appear. In other words, we unify them into a uniform 
formula based on the following two properties. 

Property 3: No matter node v is an IND or ordinary 
or MUX node, we can update their upper bound 
values as follows: 

m 

UB (q, v) = UB(q, «) - 1 + JJ(1 - Prf uasi _ slca (q, v Ci )) 

i=i 

(10) 

Where Prf uasi _ slm (q,v Ci ) > a should be held. 

Proof: According to the definition of upper bound, 
UB(g, v) represents the maximal probability of v being 
a quasi-SLCA node, which comes from the overall 
probability of a specific keyword. Therefore, the prob- 
lem of updating upper bound can be alternatively 
considered as the percentage of the probability of 
the keyword has been used for the v' descendant 
nodes becoming qualified quasi-SLCA nodes. If we 
know there are m qualified descendant nodes of v as 
returned answers, then we can compute their aggre- 
gated probabilities by i( l - Pr quasi-sicai.^ 
Therefore, the upper bound can be updated as 
UB(q,v) 1 + n£i(l - Pr% ta «- Slca (q,v ci )). 

Does the above update equation hold for MUX 
node? To answer this question, we utilize the prop- 
erties in 1231 , from which we can compute the aggre- 
gated probability by using^™ =1 Prf uasi _ slca (q,v Ct ). 
Therefore, we have UB (q, v) = UB(q, v) — 
YT=i Pr quasi-sica(^ v c,)- The equation can be con- 
verted into UB(q,v)-l + [l-2~2ti PrgaH-jcafavct)]. 

Since UZii 1 - Pr quas l -sica{^ v c i )) can be ex- 
pressed as 1 - Y7=i Pr qua Si -siccMi v c,) + A where 
A is a positive value, i.e., > 0, we can de- 

rive that 1 - Yl,iLl Pr quasi-slca V <h) < IL=lU " 

^fuasi-sicail^^))- As a consequence, we can obtain 

that UB'(q,v) = UB(q,v) - £™ x Pr% asi _ slca (q, v Ci ) 

= UB(q,v) - 1 + [1 - YZlP^uasi-sloai^)] < 
UB(q,v) - 1 + U.t i(l " Prtasi-sloail^)). 
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Therefore, UB'{q,v) = UB{q,v) - 1 + Il^iC 1 " 
Pr quasi-sica (<?' v ^)) holds for IND, ordinary and MUX 
nodes. □ 

Property 4: No matter node v is an IND or ordinary 
or MUX node, we can update their lower bound 
values as follows: 



LB'(q,v) = LB(q,v) -J2 P ^uas t -slca(Q,Vc t ) (11) 
1=1 

Where Prf uasi _ slca [q,v Ci ) > a should be held. 

Proof: For the lower bound update, we 
need to deduct the confirmed probability 
[1 rn=i(l Pr°i-sica(<l,v Ci ))] for IND 
nodes or £™i Prf uasi _ slca (q,v Ci ) for MUX 
nodes, from the original lower bound LB(q,v). 
According to the procedure of the above 
proof, we have UZii 1 ~ Pr fuasi-alca(l' v <H)) > 
1 - Y,T=i Pr quasr-sica(^ v c l )- Consequently, we have 
the inequation, 1 - EGLi(l " Pr qua Sl -sica^ v c)) 

< 1 - (! - YT=l Pr quas l -slca{^ V c l )) 

YT=i Pl 'quasi- .!«>(?» <0- Therefore, it is safe to 
use the right side to update the lower bound values. 

□ 

Example 9: Consider a 4 that has been computed 
and its probability is 0.44. Given threshold a (=0.4), 
<Z4 is returned as a quasi-SLCA result. Consequently, 
we can update the lower/upper bound values of its 
ancestor a 2 , i.e., UB'{{k 1} k 2 }, a 2 ) = 0.65 - 1 + (1 - 0.44) 
= 0.21 and LB'({fci, k 2 }, a 2 ) = 0.595 - 0.44 = 0.155. 
Since UB\{k\, k 2 }, a 2 ) < <J, a 2 can be filtered out 
effectively without computation. 

Property 3 is used to filter the unqualified nodes by 
reducing the upper bound value while Property 4 is 
used to quickly find the qualified required results by 
comparing the reduced lower bound value (for the 
probability of the remaining quasi-SLCAs) with the 
threshold value. 

Sometimes, we need to calculate the probability dis- 
tributions of keywords in a node if the given thresh- 
old a is in the range (LB(q,v),UB(q,v)]. The basic 
computational procedure is similar to the PrStack al- 
gorithm in 1 23 ] . Different from the PrStack algorithm, 
we will introduce probability density function (PDF) 
to approximately calculate the probability for a node 
if the node depends on a large number of pruned 
descendent nodes. To decide when to invoke the PDF 
while avoiding the risk of reducing precision signif- 
icantly, we would like to select and compute some 
descendant nodes that may contribute large proba- 
bilities to the node v. For the remaining descendant 
nodes, we may choose to invoke the PDF, by which 
we can reduce the time cost while still guarantee the 
precision to some extent. The detailed procedure will 
be introduced in the next section. 



5 Prune-Based Probabilistic Thresh- 
old Keyword Query Algorithm 

A key challenge of answering a PrTKQ is to identify 
the qualified result candidates and filter the unquali- 
fied ones as soon as possible. In this work, we address 
this challenge with the help of our proposed proba- 
bilistic inverted (PI) index. Two efficient algorithms 
are proposed, a comparable Baseline Algorithm and 
a Pi-based Algorithm. 

5.1 Baseline Algorithm 

In keyword search on certain XML data, it is popular 
to use keyword inverted index retrieving the rele- 
vant keyword nodes, by which the keyword search 
results are generated based on different algorithms, 
e.g., 123, COO, El, ED In probabilistic XML data, 
[23[ proposed PrStack Algorithm to compute top-fc 
SLCA nodes. In this section, we propose an effective 
Baseline Algorithm that is similar the idea of PrStack 
Algorithm. To answer PrTKQ, we need to scan all 
the keyword inverted lists once. Firstly, the keyword- 
matched nodes will be read one by one based on 
their document order. After one node is processed, 
we check if its probability can be up to the given 
threshold value a. If it is true, the node can be output 
as a quasi-SLCA node and its remaining keyword 
distributions (i.e., containing partial query keywords) 
can be continuously promoted to its parent node. 
Otherwise, we promote its complete keyword distri- 
butions (i.e., containing both all keywords or partial 
keywords) to its parent node. After that, the node at 
the top of the stack will be popped. Similarly, the 
above procedures will be repeated until all nodes 
are processed. The basic algorithm can be terminated 
when all nodes are processed. The detailed procedure 
is shown in Algorithm [TJ 

Because Baseline Algorithm only needs to scan 
the keyword node lists once, it is a fast and simple 
algorithm. However, its core computation - keyword 
distribution computation would consume lots of time, 
which motivates us to propose the Pi-based Algo- 
rithm that can quickly identify the qualified or un- 
qualified candidates using offline computed PI index 
and only compute keyword distributions for a few 
candidates. Here, Baseline Algorithm is taken as a 
comparable base to show the pruning performance 
of the Pi-based Algorithm described below. 

5.2 Pl-based Algorithm 

To efficiently answer PrTKQ, the basic idea of PI- 
based Algorithm is to read the nodes from keyword 
node lists one by one in a bottom-up strategy. For 
each node, we quickly compute its lower bound and 
upper bound by accessing PI index, which is far faster 
than computing the keyword distributions of the node 
directly. After comparing its lower/upper bounds 
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Algorithm 1 Baseline Algorithm 



Algorithm 2 Pi-based Algorithm 



k m } with threshold a, key- 



lm} from KI 



input: a query q = {fci,&2, 
word inverted (KI) index 
output: a set R of quasi-SLCA nodes 
1: load keyword node lists L = {h,h 
index; 

2: get the smallest Dewey v from L; 
3: initiate a stack Si using v; 
4: while L / (f do 

get the next smallest Dewey v from L; 
while (Si.topO ^ not v) do 
x = Si.popO; 

if x contains full keywords and Prf uasi _ slca (x) > 
a then 
9: output a; into J?; 

10: promote the rest keyword distributions of x to its 

parent x p using CombineProbfa;, x p ); 
11: 5i.push(«); 
12: while Si / do 
13: a new node v <— Si.popO; 

14: if v contains full keywords and Prf uasi _ slca (v) > a 
then 

15: output v into R; 

16: promote the rest keyword distributions of v to its 

parent v p using CombineProb(u, v p ); 
17: return R; 



with the given threshold value, we can decide if the 
node should be output as a qualified answer, skipped 
as an unqualified result, or cached as a potential result 
candidate. For example, if the current node's lower 
bound is larger than or equal to the threshold value, 
then the node can be output directly without further 
computation. This is because all its descendants have 
been checked according to the bottom-up strategy. If 
its upper bound is lower than the threshold value, 
then the node can be filtered out. Otherwise, it will 
be temporarily cached for further checking. Based on 
different cases, different operations would be applied. 
Only the nodes identified as potential result candi- 
dates need to be computed. Compared with Baseline 
Algorithm, Pi-based algorithm can be accelerated sig- 
nificantly because Baseline Algorithm has to compute 
the keyword distributions for all nodes. The detailed 
procedure has been shown in Algorithm [2] 

5.2. 1 Detailed Procedure of Pl-based Algorithm 

In Algorithm [2 Line [U-Line H] show that the proce- 
dures of initiating Pi-based Algorithm. We first load 
the keyword node lists L from KI index and probabil- 
ity node lists PIL from PI index. And then we take 
the smallest node v from L to initiate a stack Si that 
is set using the dewey codes of v. Another stack S2 is 
used to maintain the temporary filtered nodes. After 
that, the Pi-based Algorithm is ready to start. 

Next, we need to check each node in L in document 
order. Different from Baseline Algorithm, we only 
compute the keyword distribution probabilities for a 
few nodes that are first identified using the lower 
bound and upper bound in PIL. Consider v be the 



input: a query q = {ki, k 2 , k m } with threshold a, key- 
word inverted (KI) index, PI index 
output: a set R of quasi-SLCA nodes 
1: load keyword node lists L = {h, I2, l m } from KI 
index; 

2: load probability node lists PIL = 

{PIL 1 ,PIL 2 ,...,PIL m }; 
3: get the smallest Dewey v from L; 
4: initiate a stack Si using v and an empty stack S2; 
5: while L / <j> do 

6: get the next smallest Dewey v from L again; 
7: while (Si.topO < not v) do 
8: x = Si.popO; 

9: UB(q,x) and LB(q,x) <s— ComputeBound(x, 

{PILi(x)}) } 
10: if LB(q,x)> a then 
11: output x into R; 

12: UpdateBound({u a € Si\v a ~i x}, LB(q,x), 

\JB(q,x)); 
13: S'2.pop(vd £ S2\vd y x); 

14: else if UB(g,a;) > a > LB(<j,a;) then 
15: Prob(x) <— ComputeProbDist(a;, S2); 

16: if Prob(x) > a then 

17: output x into R; 

18: UpdateBound({t; a € Si\v a ~< Prob{x)); 

19: S 2 .pop(v d e S 2 \v d y x); 

20: else 

21: S2-push(x); 

22: Si.push(u); 

23: while Si / do 

24: a new node v <— Si.popO; 

25: UB(q,v) and LB(q,v) <s— ComputeBound(n, 
{PIL t (v)}); 

26: process the node v using the same codes in Line [10] - 

Line HQ 
27: return R; 



next smallest node to be processed. We compare it 
with the node x in stack Si. If v is the descendant 
node of x, then v will be pushed into Si and get the 
next smallest node from L. Otherwise, we pop out 
x from Si and check if it is a qualified quasi-SLCA 
answer. In Baseline Algorithm, it will compute the 
keyword distributions of x and combine its remaining 
distributions and the distribution of its parent based 
on promotion operations. Different from Baseline Al- 
gorithm, Pi-based Algorithm will quickly compute 
the upper bound \J¥>(q,x) and lower bound LB(g,a;) 
using PIL, which is used to differentiate the nodes 
as qualified nodes - output, unqualified nodes - filter 
and uncertain nodes - to be further checked. By doing 
this, only a few nodes need to be computed. Since 
bound computation is far faster than computation of 
keyword distribution, lots of run time cost can be 
saved in Pi-based Algorithm. Line [TOl-Line |2T1 show 
the detailed procedures. If the lower bound LB(g,a;) 
is larger than or equal to the given threshold value 
a, then x can be output as a qualified quasi-SLCA 
answer without computation. At this moment, the 
lower bound \S>{q,x) can be taken as the temporary 
probability of x being a quasi-SLCA result because 
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the exact probability of x is delayed until we need to 
calculate its exact probability value. Subsequently the 
temporary probability value LB(g,ir) and the probabil- 
ities of x' descendant quasi-SLCA results can be used 
to update the lower / upper bounds of the ancestors of 
x in stack Si based on Equation [TT] and Equation [lOl 
respectively. If the lower bound I3(q,x) is lower than 
a while the upper bound XJB(q,x) is larger than or 
equal to a, then we need to compute the keyword 
distributions of x using the cached descendant nodes 
in S2. Based on the computed probability Prob(x) of 
x, it can be decided to be output as a qualified answer 
or filtered as an unqualifed candidate. If the upper 
bound UB(q,x) is lower than a, then x will be pushed 
into S2 for the possible computation of its ancestors. 

There are two main functions in Pi-based Algo- 
rithm. The first one is ComputeProbDist(v, S2) for 
computing the probability of full keyword distribu- 
tion of v using the descendant nodes in S2. The 
second is UpdateBound({u a ~< v\v a € Si}, LB(q,v) or 
Prob(<7,t>)) for updating the bounds of the nodes to be 
processed. 

5.2.2 Function ComputeProbDistQ 

The function ComputeProbDistQ, S2) can be imple- 
mented in two ways, Exact Computation or Approximate 
Computation. 

Exact Computation is to actually calculate the prob- 
ability of v being a quasi-SLCA node by scanning all 
the nodes in the stack S2 that maintains the descen- 
dant nodes of v. The processing strategy is similar to 
Baseline Algorithm in Section 15.11 In other words, it 
needs to visit the nodes in S2 one by one and compute 
the local keyword distribution of each node, and then 
promotes the intermediate results to its parent. After 
all nodes in S2 are processed, the probability of v will 
be obtained because it aggregates all the probabilities 
from its descendant nodes. 

Approximate Computation is to approximately calcu- 
late the probability of v being a quasi-SLCA node 
based on a partial set of nodes in the stack S2 that 
maintains the descendant nodes of v. The approximate 
computation can be made according to different dis- 
tribution types, e.g., uniform distributions, piecewise 
polynomials, poisson distributions, etc. In this work, 
we consider normal or Gaussian distributions in more 
detail. 

As we know Gaussian distribution is considered the 
most prominent probability distribution in statistics. 
However, the PDF of Gaussian distribution cannot be 
applied to PrTKQ over probabilistic XML data directly 
due to two main challenges. The first challenge is to 
simulate the continuous distributions using discrete 
distributions based on the real conditions in order to 
reduce the approximate errors as much as possible, 
and the second is to embody the multiple keyword 
variables in the PDF. 



Generally, the probability density function of a 

Gaussian distribution N(fi,a 2 ) of mean p, and vari- 
ance a 2 is: 

f{x) = -1 e -(-") a /(^) (12) 

V 27Tf7 Z 

Addressing Challenge 1: The density function has 
a shape of a bell centered in the mean value \i 
with variance a 2 . Based on the definition of Gaussian 
distribution, the Gaussian distribution is often used 
to describe, at least approximately, measurements that 
tends to cluster around the mean. Therefore, consider 
the mean p be the partial computed probability value 
of v be a quasi-SLCA node, which guarantees the real 
probability value will not be significantly different 
from the probability base that has already been cal- 
culated based on promising descendant nodes. The 
value of the variance a 2 can be chosen from the range 
[l-#computed descendant nodes/#total descendant 
nodes, 1] based on the visited /unvisited descendant 
nodes in 82- This is because the more the descen- 
dant nodes are actually computed, the higher the 
percentage of the values would be drawn within one 
standard deviation a away from the mean. Extremely, 
if all descendant nodes are computed actually, 100% 
of values can be drawn within one stardard deviation. 
Therefore, we select and compute a few descendant 
nodes of v from 5*2, which can contribute relatively 
higher probabilities to make v a quasi-SLCA node. 
In this work, we use heuristic method to select a 
few descendant nodes with the higher probabilities 
of single keywords in the descendant nodes of v. 
And then, we take the partially computed probability 
as the base of the probability density function of a 
Gaussian distribution. 

Consider v be a node to be evaluated and \JB(q,v) 
be its current upper bound value. We have, 

rUB(q,v) 

Pr^ZT-27M= (13) 
Jo 

After substituting Equation [12] into Equation [131 we 
get, 

r< r< ■ [VB(q.v) -y 2 

, > G .Gaussian / \ I x — 1 

quasi — slca v y > / / JTi 9 

Jo y2-KG l 

(14) 

Where p is the partially computed probability, a 2 is set 
as l-#computed descendant nodes /#total descendant 
nodes. 

Addressing Challenge 2: To embody all the 
keyword variables in the PDF, we introduce the 
joint/conditional Gaussian distribution based on the 
work in 1261 . Assume a PrTKQ contains two keywords 
k x and k y . We have the conditional PDF as follows. 

2. Note that UB(q,u) has been updated if v has descendant 
nodes that are qualified answers, i.e., it minus the probability 
contributions of the qualified answers. 
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^ [(i-wHpl^-K-i'i))] 2 

fr\x(y\x)= =e ^ 

(15) 

Since f(x,y) = fx(%) * fY\x{y\ x ), after substituting 
Equation fl2l into Equation [L5l we get 

f(x,y) = 

(x-vx) 2 to-* y )-(p(^-)(*-Mx))] 2 / 16) 
1 ^3 ;3 ^ ' 

Y — e 2a x 2 °y 

2ira x <?Y \/l-p 2 

If we make an assumption that x and y are inde- 
pendent keyword variables i.e., p = 0, and assume 
MX = My = M and ex = °V = c, then we have 

Therefore, Equation [17] can be easily extended to 
multiple keyword variables that are assumed as inde- 
pendent. We can compute the probability of v w.r.t. a 
PrTKQ{fc 1 ,fc 2 ,...,fc f }. 

r> G , Gaussian / \ _ 
^ r quasi- slca V?> W J — 

(x 1 - M ) 2 + .-. + (x t - M ) 2 

f E/B(<2,„) f E/B(<z,u) , , 

Jo —J (2rf7v uxi...ux t 

(18) 

Where // is the partially computed probability, a 2 is set 
as l-#computed descendant nodes/#total descendant 
nodes. 

In the experiments, we call Matlab from Java to cal- 
culate Equation [TB] The estimated results are used to 
show the comparison between the actual computation 
and approximation computation. The results verify 
the usability of Gaussian distribution to measure the 
probability. 

5.2.3 Function UpdateBoundQ 

For each ancestor node v a e Si (v a -< v), we need 
to update the upper bounds and lower bounds using 
Function UpdateBound() based on Equation [10] and 
Equation [IT] respectively. To guarantee the complete- 
ness of the answer set, the parameters of the function 
may be different based on conditions. For example, if 
LB(g,w) is larger than or equal to the threshold value a 
as shown in Algorithm [2] Line [12] then the probability 
value LB(<7,t>) is used to update the upper bounds of v' 
ancestors while the probability value UB(<7,v) is used 
to update the lower bounds of v' ancestors; if LB(<7,v) 
is smaller than a and UB(q,v) is larger than or equal 
to g as shown in Algorithm [2] Line [18] the actual 
or approximate probability value Prob(v) computed 
by Function ComputeProbDist(u, £2) will be utilized 
to update the upper /lower bounds of v' ancestors 
together. 

Here, we use two hashmaps to implement Function 
UpdateBound(). For a node, one hashmap is used 
to cache the dewey of the node as a key, and the 



lower/upper bounds as a value where the bounds 
are computed based on PI index. Another hashmap 
is used to record the probability that the descendants 
of the node having been identified as qualified quasi- 
SLCA answers. When a node is coming, we can 
quickly get the updated lower/upper bounds based 
on the two hashmaps. 

6 Experimental Studies 

We conduct extensive experiments to test the perfor- 
mance of our algorithms: Baseline Algorithm (BA); 
Pi-based Exact-computation Algorithm (PIEA) that 
implements Function ComputeProbDist() by exactly 
computing the probability distributions of the key- 
word matched nodes; and Pi-based Approximate- 
computation Algorithm (PIAA) that makes approx- 
imated computation based on the Gaussian distri- 
bution of keywords while still exactly computing 
the probability distributions of the keyword matched 
nodes that have the higher probabilities. All these 
algorithms were implemented in Java and run on 
a 3.0GHz Intel Pentium 4 machine with 2GB RAM 
running Windows 7. 

6.1 Dataset and Queries 

We use two real datasets, DBLP [27] and Mondial EHI , 
and a synthetic XML benchmark dataset XMark E9l 
for testing the proposed algorithms. For XMark, we 
also generate four datasets with different sizes. The 
three types of datasets are selected based on their 
features. DBLP is a relatively shallow dataset of large 
size; Modial is a deep and complex, but small dataset; 
XMark is a balanced dataset with varied depth, com- 
plex structure and varied size. Therefore, they are 
chosen as test datasets. 

For each XML dataset used, we generate the cor- 
responding probabilistic XML tree, using the same 
method as used in (12|. We visit the nodes in the 
original XML tree in pre-order way. We first set the 
random ratio of IND: MUX: Ordinary as 3:3:4. For each 
node v visited, we randomly generate some distribu- 
tional nodes with "IND" or "MUX" types as children 
of v. Then, for the original children of v, we choose 
some of them as the children of the new generated 
distributional nodes and assign random probability 
distributions to these children with the restriction that 
the sum of them for a MUX node is no greater than 1 . 
The generated datasets are described in Table [JJ And 
we select terms and construct a set of keyword queries 
to be tested for each dataset. Due to the limited space, 
we only show six of these queries for each dataset. 
For each different sets of queries, the terms in the 
first two queries have small size of keyword matched 
nodes; the terms of the middle two queries relate to 
a medium size of keyword matched nodes; the terms 
of the last queries are based on the computation of a 
larger number of keyword matched nodes. 
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TABLE 1 
Properties of PrXML data 



ID 




size 


#IND 


#MUX 


#Ordinary 


Docl 


XMark 


10M 


26k 


26k 


145k 


Doc2 




20M 


54k 


52k 


200k 


Doc3 




40M 


98k 


100k 


606k 


Doc4 




80M 


329k 


368k 


1M 


Doc5 


Modial 


1.2M 


8k 


9k 


20k 


Doc6 


DBLP 


136M 


759k 


589k 


3M 


Doc7 


1NEX 


5,898M 


13M 


10M 


52M 



6.2 Varying Keyword Queries 
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Fig. 5. Evaluation of Keyword Queries over 
XMark20M, Mondial, DBLP, INEX where er=0.3 

Figure |5] shows the experimental results when we 
run the 18 queries over the selected three datasets 
where X represents the queries over 20MB XMark 
dataset, M represents the queries over Mondial 
dataset, D represents the queries over DBLP dataset 



and I represents the queries over INEX dataset. And 
the required threshold value a is set as 0.3. From 
the results, we can find that compared with the BA 
algorithm, most of time the PIEA algorithm can re- 
duce the response time by nearly 40% using the prun- 
ing techniques based on the updated lower /upper 
bounds. The PIAA algorithm can further improve the 
time efficiency by about 20% with the assumption 
of probability distribution of keywords. For XI, X2 
and Mi, M-x, the response time of BA algorithm is 
approaching to the time cost of the other two algo- 
rithms. Especially for query X2, PIEA algorithm is 
overwhelmed by BA algorithm. This is because both 
the number of keyword-matched nodes and the size 
of answer sets are smaller than the other queries. From 
the four figures on the left side of Figure we find 
that the scalability of PIEA and PIAA algorithms is 
much better than that of BA algorithm by testing the 
queries with different sizes of answer sets. 

To measure the precision and recall of PIAA algo- 
rithm, we utilize the P&R equations in information 
retrieval area as follows. 

\Rba<^ Rpiaa I . J^qqqH \Rba<~iRpiaa\ 



Precision = 



\Rpiaa\ ' - t "^ u "" \Rba\ 

Because PIEA algorithm can find the same results 
with BA algorithm by exactly computing the required 
probability distributions, Figure only demonstrates 
the precision and recall of PIAA algorithm for differ- 
ent queries over each dataset. From the experimental 
results, we find that the precision and recall can 
reach up to at least 0.7 for XMark, 0.6 for Mon- 
dial, 0.7 for DBLP, and 0.66 for INEX, respectively 
Sometimes, it can be up to 0.9 at most, e.g., XI, X2, 
Ml, M2, II, 12, etc. Comparing all the tested queries, 
we can get a general conclusion that the precision 
and recall will be decreased with the increase of 
potential result size. However, from the experiments, 
they will not be lower than 0.6 because (1) the re- 
sults with higher probabilities are exactly selected 
and computed, which does not need to depend on 
the Gaussian assumption; (2) the rest minor results 
are estimated by using Gaussian assumption over the 
keyword distributions that have been excluded by 
the results with higher probabilities. In other words, 
PIAA strategy can return the percentage (>0.6) of 
significant results, but may underestimate the minor 
results. 



6.3 Varying Threshold Values 

To test the adaptability of the proposed algorithms to 
threshold query, we test the changes of response time 
and precision&recall with the increase of threshold 
value. Figure [6] shows the experimental results when 
the threshold value varies from 0.2 to 0.7 for queries 
X5, M5, D5 and 15. The left four figures in Figure [6] 
show that PIEA and PIAA algorithms can overwhelm 
BA algorithm greatly with the increase of threshold 
value. This is because BA algorithm has to scan and 
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Fig. 7. Response Time and F-Measure for different 
datasets 



6.4 Varying Probabilistic Document Size 

We firstly take XMark dataset as an example to test the 
performance of the three algorithms when we increase 
the document size. We test all the six queries of XMark 
dataset, but in this paper, we only show the results of 
the query X3 where the threshold value is specified 
as 0.3. From Figure 7(a) we can see that the response 
time of all the three algorithms will be increased when 
the document size increases from 10MB to 80MB. 
However, the increase of PIEA and PIAA algorithms 
is much slower. Particularly, PIAA just changes a 
bit. The comparison illustrates that PIEA and PIAA 
algorithms can obtain much better performance than 
BA algorithm. In addition, all algorithms show linear 
degradation, i.e., they have the similar scalability. 

Secondly we evaluate the precision and recall of 
PIAA algorithm using a variant of F-measure that 
aggregates the precision and recall of all queries to- 
gether. 

F — measure 



2* 



P(gi)*fl(gi) 

P(«i)+i*(9i) 
->6/ 



Where Pfe) = Ei(-Pfe))/6, and Rfa) = 
Ei(#0i))/6. 

To evaluate the F-measure of PIAA algorithm, we 
test 24 queries with different threshold values: 0.3, 0.5 
and 0.7. From the results in Figure |7(b)| we can find 
that the F-measure can be over 0.75 for all the four 
datasets. 



compute all the relevant nodes while PIEA and PIAA 
algorithms can skip more nodes when the threshold 
value becomes large. However, when the threshold 
value is up to 0.5, the change of the time cost will be 
smooth because once a quasi-SLCA node is identified, 
its ancestor nodes can be skipped definitely, which is 
true for the threshold values larger than 0.5. From the 
right four figures in Figure HJ we can find that the 
precision and recall will be affected by the change of 
threshold values. When the threshold value reaches 
up to 0.5, the precision and recall can be up to 0.8 at 
least. On the contrary, if the threshold value is lower 
than 0.2, the precision and recall would be decreased 
to 0.5 based on the selected datasets. 



7 Related Work 

The topic of probabilistic XML has been studied re- 
cently. Many models have been proposed, together 
with structured query evaluations. Nierman et al. 13 
first introduced a probabilistic XML model, ProTDB, 
with the probabilistic types IND - independent and 
MUX - mutually-exclusive. Hung et al. modeled 
the probabilistic XML as directed acyclic graphs, sup- 
porting arbitrary distributions over sets of children. 
Keulen et al. |9J used a probabilistic tree approach for 
data integration where its probability and possibility 
nodes are similar to MUX and IND, respectively 
Cohen et al. |30| incorporated a set of constraints to 
express more complex dependencies among the prob- 
abilistic data. They also proposed efficient algorithms 
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to solve the constraint-satisfaction, query evaluation, 
and sampling problem under a set of constraints. In 
|Tl2l , Kimelfeld et al. summarized and extended the 
probabilistic XML models previously proposed, the 
expressiveness and tractability of queries on different 
models are discussed with the consideration of IND 
and MUX. UTTl studied the problem of evaluating 
twig queries over probabilistic XML that may re- 
turn incomplete or partial answers with respect to 
a probability threshold to users. fl3l proposed and 
addressed the problem of ranking top-k probabilities 
of answers of a twig query. All the above work 
focused on the discussions of probabilistic XML data 
model and/or structured XML query, e.g., twig query. 
The most closely related work is E3l that proposed 
two algorithms to answer top-Zc keyword queries over 
probabilistic XML data. However, compared with [23], 
in this work we propose a probabilistic inverted in- 
dex that can be used to efficiently answer threshold 
keyword queries by reducing the computational cost 
of unqualified nodes. In addition, we also take into 
account the relaxation (i.e., quasi-SLCA) of results for 
keyword search w.r.t. a threshold value while 11231 
focused on the strict SLCA semantics of results. 

There are some other work to discuss probabilistic 
index for query evaluation and /or data management. 
Although [31] discussed probabilistic inverted index 
as ours, its data model is relational in which each 
tuple is associated with a probability value and all tu- 
ples are assumed independent. In our work, we built 
the probabilistic inverted index based on probabilistic 
XML data model with IND and MUX semantics. 
Another difference is that we answer keyword query 
while [31] processes equality query. Another work dis- 
cussing probabilistic index is l32ll that first generates 
possible worlds and then cluster them based on prob- 
ability values with a limited distance. The problem 
is that generating all possible worlds is very time- 
consuming in XML data. In our work, we avoided 
the generation of possible worlds. 

8 Conclusions 

In this work, we first proposed and investigated 
the problem of finding quasi-SLCA for PrTKQs over 
probabilistic XML data. And then we designed a PI 
index and analyzed the pruning features of PI index. 
Based on the lower and upper bounds computed 
from PI index, the proposed Pi-based algorithm can 
quickly identify the qualified results and filter the 
unqualified ones. Our experimental results demon- 
strated the comparison of Baseline algorithm, PI- 
based Exact-computation Algorithm (PIEA) and PI- 
based Approximate-computation Algorithm (PIAA), 
which verified our motivation and approaches. 
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